{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(text-regex)=\n",
    "# Regular Expressions, aka regex\n",
    "\n",
    "This chapter covers how to regular expressions to perform processing of text.\n",
    "\n",
    "Regex, aka regular expressions, provide a way to both search and change text. Their advantages are that they are concise, they run very quickly, they can be ported across languages (they are definitely not just a Python thing!), and they are very powerful. The disadvantage is that they are confusing and take some getting used to!\n",
    "\n",
    "![](https://imgs.xkcd.com/comics/regular_expressions.png)\n",
    "\n",
    "You can live code regex in a couple of places, the first is within Visual Studio Code itself. Do this by clicking the magnifying glass in the left-hand side panel of options. When the search strip appears, you can put a search term in. To the right of the text entry box, there are three buttons, one of which is a period (full stop) followed by an asterisk. This option allows the Visual Studio text search function to accept regular expressions. This will apply regex to all of the text in your current Visual Studio workspace.\n",
    "\n",
    "Another approach is to head over to [https://regex101.com/](https://regex101.com/) or [https://regexr.com/](https://regexr.com/) and begin typing your regular expression there (regexr's cheat sheets and reference patterns are well worth checking out too). You will need to add some text in the box for the regex to be applied to.\n",
    "\n",
    "Try either of the above with the regex `string \\w+\\s`. This matches any occurrence of the word 'string' that is followed by another word and then a whitespace. As an example, 'string cleaning ' would be picked up as a match when using this regex.\n",
    "\n",
    "Within Python, the `re` library provides support for regular expressions. Let's try it:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "\n",
    "text = \"It is true that string cleaning is a topic in this chapter. string editing is another.\"\n",
    "re.findall(\"string \\w+\\s\", text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`re.findall` returns all matches. There are several useful search-like functions in `re` to be aware of that have a similar syntax of `re.function(regex, text)`. The table shows what they all do\n",
    "\n",
    "\n",
    "| Function     | What it does                                                    | Example of use                              | Output for given value of `text`                                      |\n",
    "|--------------|-----------------------------------------------------------------|---------------------------------------------|-----------------------------------------------------------------------|\n",
    "| `re.match`   | Declares whether there is a match at the beginning of a string. | `re.match(\"string \\w+\\s\" , text) is True`  | `None`                               |\n",
    "| `re.search`  | Declares whether there is a match anywhere in the string.       | `re.search(\"string \\w+\\s\" , text) is True` | `True`                                                                |\n",
    "| `re.findall` | Returns all matches.                                            | `re.findall(\"string \\w+\\s\" , text)`         | `['string cleaning ', 'string editing ']`                             |\n",
    "| `re.split`   | Splits text wherever a match occurs.                            | `re.split(\"string \\w+\\s\" , text)`           | `['It is true that ', 'is a topic in this chapter. ', 'is another.']` |\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Another really handy regex function is `re.sub`, which substitutes one bit of text for another if it finds a match. Here's an example:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "new_text = \"new text here! \"\n",
    "re.sub(\"string \\w+\\s\", new_text, text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Special Characters\n",
    "\n",
    "So far, we've only seen a very simple application of regex involving a vanilla word, `string`, the code for another word `\\w+` and the code for a whitespace `\\s`. Let's take a more comprehensive look at the regex special characters:\n",
    "\n",
    "| Character | Description                                                 | Example Text                           | Example Regex         | Example Match Text  |\n",
    "|-----------|--------------------------------------------------------|----------------------------------------|-----------------------|---------------------|\n",
    "| \\d        | One Unicode digit in any script                        | \"file_93 is open\"                      | `file_\\d\\d`           | \"file_93\"           |\n",
    "| \\w        | \"word character\": Unicode letter, digit, or underscore | \"blah hello-word blah\"                 | `\\w-\\w`               | \"hello-world\"       |\n",
    "| \\s        | \"whitespace character\": any Unicode separator          | \"these are some words with spaces\"     | `words\\swith\\sspaces` | \"words with spaces\" |\n",
    "| \\D        | Non-digit character (opposite of \\d)                   | \"ABC 10323982328\"                    | `\\D\\D\\D`              | \"ABC\"               |\n",
    "| \\W        | Non-word character (opposite of \\w)                    | \"Once upon a time *\"                   | `\\W`                  | \"*\"                 |\n",
    "| \\S        | Non-whitespace character (opposite of \\s)              | \"y        \"                            | `\\S`                  | \"y\"                 |\n",
    "| \\Z        | End of string                                          | \"End of a string\"                      | `\\w+\\Z`               | \"string\"\"           |            |\n",
    "| .        | Match any character except the newline          | \"ab=def\"                               | `ab.def`            | \"ab=def\"                |\n",
    "\n",
    "\n",
    "Note that whitespace characters include newlines, `\\n`, and tabs, `\\t`.\n",
    "\n",
    "## Quantifiers\n",
    "\n",
    "As well as these special characters, there are quantifiers which ask for more than one occurence of a character. For example, in the above, `\\w\\w` asked for two word characters, while `\\d\\d` asks for two digits. The next table shows all of the quantifiers.\n",
    "\n",
    "| Quantifier | Role                                       | Example Text               | Example Regex | Example Match      |\n",
    "|------------|--------------------------------------------|----------------------------|---------------|--------------------|\n",
    "| {m}        | Exactly m repetitions                      | \"936 and 42 are the codes\" | `\\d{3}`       | \"936\"              |\n",
    "| {m,n}      | From m (default 0) to n (default infinity) | \"Words up to four letters\" | `\\b\\w{1,4}\\b` | \"up\", \"to\", \"four\" |\n",
    "| *          | 0 or more. Same as {,}                     | \"42 is the code\"           | `\\d*\\s`       | \"42\"               |\n",
    "| +          | 1 or more. Same as {1,}                    | \"4 323 hello\"              | `\\d+`         | \"4\", \"323\"         |\n",
    "| ?          | Optional, so 0 or 1. Same as {,1}.                       | \"4 323 hello\"              | `\\d?\\s`       | \"4\"                |\n",
    "\n",
    "```{admonition} Exercise\n",
    "Find a single regex that will pick out only the percentage numbers from both \"Inflation in year 3 was 2 percent\" and \"Interest rates were as high as 12 percent\". \n",
    "```\n",
    "\n",
    "## Metacharacters\n",
    "\n",
    "Now, as well as special characters and quantifiers, we can have meta-character matches. These are not characters *per se*, but starts, ends, and other bits of words. For example, `\\b` matches strings at a word (`\\w+`) boundary, so if we took the text \"Three letter words only are captured\" and ran `\\b\\w\\w\\w\\b` we would return \"are\". `\\B` matches strings not at word (`\\w+`) boundaries so the text \"Bricks\" with `\\B\\w\\w\\B` applied would yield \"ri\". The next table contains some useful metacharacters.\n",
    "\n",
    "| Metacharacter Sequence | Meaning                       | Example Regex | Example Match                                                                |\n",
    "|------------------------|-------------------------------|--------------------|------------------------------------------------------------------------------|\n",
    "| ^                      | Start of string or line       | `^abc`               | \"abc\" (appearing at start of string or line)                                 |\n",
    "| $                      | End of string, or end of line | `xyz$`               | \"xyz\" (appearing at end of string or line)                                   |\n",
    "| \\b                     | Match string at word (\\w+) boundary                 | `ing\\b`              | \"match**ing**\" (matches ing if it is at the end of a word)                   |\n",
    "| \\B                     | Match string not at word (\\w+) boundary              | `\\Bing\\B`            | \"st**ing**er\" (matches ing if it is not at the beginning or end of the word) |\n",
    "\n",
    "Because so many characters have special meaning in regex, if you want to look for, say, a dollar sign or a dot, you need to escape the character first with a backward slash. So `\\${1}\\d+` would look for a single dollar sign followed by some digits and would pick up the '\\$50' in 'she made \\$50 dollars'.\n",
    "\n",
    "```{admonition} Exercise\n",
    "Find the regex that will pick out only the first instance of the word 'money' and any word subsequent to 'money' from the following: \"money supply has grown considerably. money demand has not kept up.\". \n",
    "```\n",
    "\n",
    "## Ranges\n",
    "\n",
    "You probably think you're done with regex, but not so fast! There are more metacharacters to come. This time, they will represent *ranges* of characters.\n",
    "\n",
    "| Metacharacter Sequence | Description       | Example Expression | Example Match    |\n",
    "|------------------------|---------------------------------------------------------|--------------------|-----------------------------------|\n",
    "| \\[characters\\]           | The characters inside the brackets are part of a matching-character set  | `[abcd]`             | a, b, c, d, abcd     |\n",
    "| \\[^...\\]   | Characters inside brackets are a non-matching set; a character not inside is a matching character. | `[^abcd]`            | Any occurrence of any character EXCEPT a, b, c, d. |\n",
    "| \\[character-character\\]  | Any character in the range between two characters (inclusive) is part of the set  | `[a-z]`   | Any lowercase letter    |\n",
    "| \\[^character\\]           | Any character that is not the listed character     | `[^A]`      | Any character EXCEPT capital A     |\n",
    "\n",
    "Ranges have two more neat tricks. The first is that they can be concatenated. For example, `[a-c-1-5]` would match any of a, b, c, 1, 2, 3, 4, 5. They can also be modified with a quantifier, so `[a-c0-2]{2}` would match \"a0\" and \"ab\".\n",
    "\n",
    "\n",
    "## Greedy versus lazy regexes\n",
    "\n",
    "Buckle up, because this one is a bit tricky to grasp. Adding a `?` after a regex will make it go from being 'greedy' to being 'lazy'. Greedy means that you will match the longest possible string that hits the condition. Lazy will mean that you get the shortest possible string matching the condition. It's easiest to demonstrate with an example:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_string = \"stackoverflow\"\n",
    "greedy_regex = \"s.*o\"\n",
    "lazy_regex = \"s.*?o\"\n",
    "\n",
    "print(f\"The greedy match is {re.findall(greedy_regex, test_string)[0]}\")\n",
    "print(f\"The lazy match is {re.findall(lazy_regex, test_string)[0]}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the former (greedy) case, we get from an 's' all the way to the last 'o' within the same word. In the latter (lazy) case we just get everything between the start and first occurrence of an 'o'. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Matches versus capture groups\n",
    "\n",
    "There is often a difference between what you might want to match and what you actually want to *grab* with your regex. Let's say, for example, we're parsing some text and we want any numbers that follow the format '$xx.xx', where the 'x' are numbers but we don't want the dollar sign. To do this, we can create a *capture group* using brackets. Here's an example:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "text = \"Product 1 was $45.34, while product 2 came in at $50.00 however it was assessed that the $4.66 difference did not make up for the higher quality of product 2.\"\n",
    "re.findall(\"\\$(\\d{2}.\\d{2})\", text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's pick apart the regex here. First, we asked for a literal dollar sign using `\\$`. Next, we opened up a capture group with `(`. Then we said only give us the numbers that are 2 digits, a period, and another 2 digits (thus excluding \\$4.66). Finally, we closed the capture group with `)`.\n",
    "\n",
    "So while we specify a *match* using regex, while only want running the regex to return the *capture group*.\n",
    "\n",
    "Let's see a more complicated example."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sal_r_per = r\"\\b([0-9]{1,6}(?:\\.)?(?:[0-9]{1,2})?(?:\\s?-\\s?|\\s?to\\s?)[0-9]{1,6}(?:\\.)?(?:[0-9]{1,2})?)(?:\\s?per)\\b\"\n",
    "text = \"This job pays gbp 30500.00 to 35000 per year. Apply at number 100 per the below address.\"\n",
    "re.findall(sal_r_per, text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this case, the regex first looks for up to 6 digits, then optionally a period, then optionally another couple of digits, then either a dash or 'to' using the '|' operator (which means or), followed by a similar number, followed by 'per'.\n",
    "\n",
    "But the capture group is only the subset of the match that is the number range-we discard most of the rest. Note also that other numbers, even if they are followed by 'per', are not picked up. `(?:)` begins a *non-capture group*, which matches only but does not capture, so that although `(?:\\s?per)` looks for \" per\" after a salary (with the space optional due to the second `?`), it does not get returned.\n",
    "\n",
    "```{admonition} Exercise\n",
    "Find a regex that captures the wage range from \"Salary Pay in range $9.00 - $12.02 but you must start at 8.00 - 8.30 every morning.\". \n",
    "```\n",
    "\n",
    "This has been a whirlwind tour of regexes. Although regex looks a lot like gobbledygook, it is a really useful tool to be able to deploy for more complex string cleaning and extraction tasks."
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
